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Abstract 

We improve recently published results about resources of Restricted Boltz- 
mann Machines (RBM) and Deep Belief Networks (DBN) required to make them 
Universal Approximators. We show that any distribution p on the set {0, 1}" of bi- 
nary vectors of length n can be arbitrarily well approximated by an RBM with k—1 
hidden units, where k is the minimal number of pairs of binary vectors differing 
in only one entry such that their union contains the support set of p. In impor- 
tant cases this number is h alf of the cardinality of the support set of p (given in 



Le Roux & Bengio 



(l2008h ). We construct a DBN with 



2(n-b) 



, b ~ log n, hidden 



layers of width n that is capable of approximating an y distribution on |0, 1|" ar - 



bitrarily well. This confirms a conjecture presented in lLe Roux & Bengio 



torn . 



1 Introduction 



his w ork rests upon ideas presented in lLe Roux & Bengio 



(|2010h . We positively resolve a conjecture that was posed in 



(2008) and 



Le Roux & Bengio 



Le Roux & Bengio I 



Before going into the details of this conjecture we first recall some basic notions. 



* montufar@mis.mpg.de 



The definition of RBM's and DBN's that we use is the one given in the papers 
mentioned above and references therein. For details the reader is referred to those 
works. Here we give a short description: A Boltzmann Machine consists of a collection 
of binary stochastic units, where any pair of units may interact. The unit set is divided 
into visible and hidden units. Correspondingly the state is characterized by a pair (t>, h) 
where v denotes the state of the visible and h denotes the state of the hidden units. One is 
usually interested in distributions on the visible states v and would like to generate these 
as marginals of distributions on the states (v, h). In a general Boltzmann Machine the 
interaction graph is allowed to be complete. A Restricted Boltzmann Machine (RBM) 
is a special type of Boltzmann Machine, where the graph describing the interactions is 
bipartite: Only connections between visible and hidden units appear. It is not allowed 
that two visible units or two hidden units interact with each other (see Fig. [U). The 
distribution over the states of all RBM units has the form of the Boltzmann distribution 
p(f , h) oc exp(/i^iy ■v + B -v + C -h), where t> is a binary vector of length equal to the 
number of visible units, and h a binary vector with length equal to the number of hidden 
units. The parameters of the RBM are given by the matrix W and the two vectors B 
and C. A Deep Belief Network consists of a chain of layers of units. Only units from 
neighboring layers are allowed to be connected, there are no connections within each 
layer. The last two layers have undirected connections between them, while the other 
layers have connections directed towards the first layer, the visible layer. The general 
idea of a DBN is to assume that all layers are of similar size, as shown in Fig. [Tl 

A major difficulty in the use of Boltzmann Machines always has been the slowness 
of learning. In order to overcome this problem, DBN's have been proposed as an al- 
ternative to classical B oltzmann Machines. An efficient learning algorithm for DBN's 



was given in the paper 



Hinton et. al. 



(l2006h 



The fundamental questions along the above-mentioned previous work are the fol- 
lowing: Does a DBN exist that is capable of approximating any distribution on the vis- 
ible states through appropriate choice of parameters? We will refer to such a DBN as 
a universal DBN approximator (similarly we will use the denomination universal RBM 
approximator). If universal DBN approximators exist, what is their minimal size? 

Since DBN's are more difficult to study than RBM's, as a preliminary step, cor- 
responding questions related to the representational power of RBM's have been ad- 



DBN 



RBM 




Figure 1 : In the left side we sketched the graph of interactions in an RBM, in the right 
side the corresponding graph for a DBN with n = 4 visible units (drawn brighter). 
An arbitrary weight can be assigned to every edge. Beside this connection weights, 
every node contains an individual ojfset weight. Every node takes value or 1 with a 
probability that depends on the weights. The RBM and DBN of siz e depicted above are 



examp les of universal approximators of distributions on {0, l}'^ (JLe Roux & Bengio 



( 2008 ) and 



Le Roux & Bengio I (120 10|) respectively). In the present paper is shown that 



the number of hidden units in the RBM can be halved, and the number of hidden layers 
in the DBN can be roughly halved. 



dressed. Theorem 2 in lLe Roux & Bengio I (|2008|) shows that any distribution on {0, 1}" 
with support of cardinality s is arbitrarily well approximated (with respect to the KuU- 
back Leibler divergence) by the marginal distribution of an RBM containing s + 1 
hidden units: 



Theorem 2 in \Le Roux & Beng io I QOOM- ^ny distribution on {0, 1}" can be approx- 
imated arbitrarily well with an RBM with s + 1 hidden units, where s is the number of 
input vectors whose probability does not vanish. 

This theorem proved the existence of a univers al RBM approximator . The e xistence 



proof of a universal DBN app r oxima tor is due to ISutskever & Hinton 



precisely. 



Sutskever & Hinton 



(l2008h . More 



(120081) explicitely constructed a DBN with ~ 3 ■ 2" hid- 
den layers of width n+ 1 that approximates any distribution on {0, 1}". Given that the 
existence problem of universal DBN approximators was positively resolved through 
this result, the efforts have been put into optimizing the size, i.e. reducing the number 
of parameters. This can be done by reducing the number of hidden layers involved in a 
DBN, or by making the hidden layers narrower. In terms of simple counting arguments, 
we give a lower bound on the minimal number of hidden layers required for the univer- 
sality of a DBN with layers of size n. The number of free parameters in such a DBN is 



square of the width of each layer x number of hidden layers + number of units, which 
for k hidden layers is k((n? + n) + n. On the other hand, the number of parameters 
needed to describe all distributions on 2" elements, e.g. over binary vectors of length n, 
is 2" — 1 . Therefore, a lower bound on the number of hidden layers of a universal DBN 



approximator is given by 



2"-l~-n 



(which yields 2" — 1 free parameters). Otherwise the 



n(n+l) 

number of parameters would not be sufficient. Asymptotically, this bound is of order 
^. Certainly, since the architecture of DBN's makes important restrictions on the way 
the parameters are used, such a lower bound is not necessarily achievable. In particular 
the approximation of a distribution through a DBN or RBM is not unambiguous, i.e. 
for several choices of t he parameters the sar t ie dist ribution is produced as marginal dis- 



tribution. However, in 



Le Roux & Bengio 



( 2010 ) it has been shown that a number of 



hidden layers of order — is sufficient: 



Theorem 4 in 



Le Roux & Bengio I nlOlOj) . Ifn = 2*,a DBN composed of — + 1 layers 



of size n is a universal approximator of distributions on {0, 1}". 



In the paper lLe Roux & Bengio I (120101) the optimality of the bound given in this the- 
orem remains an open problem. However, their proof method suggests the sufficiency 
of less hidden layers, which was conjectured in their pap er. The proof of Theorem 4 



crucially depends on the authors' previous Theorem 2 in iLe Roux & Bengio 



(l2008h . 



Our main contribution is to sharpen Theorem 2 (see Theorem [U in Section |2]) which 
allows us to even better exploit their method and thereby confirm their conjecture (see 
Theorem |3] in Section |2l). 

2 Results 



2.1 Restricted Boltzmann Machines 



The following Theorem [T] sharpens Theorem 2 in iLe Roux & Bengio 



( 2010 ). We will 



use it (its Corollary |2l) in the proof of our main result. Theorem [3l 

Theorem 1 (Reduced RBM's which are universal approximators). Any distribution p 
on binary vectors of length n can be approximated arbitrarily well by an RBM with 
k — 1 hidden units, where k is the minimal number of pairs of binary vectors, such that 



the two vectors in each pair differ in only one entry, and such that the support set ofp 
is contained in the union of these pairs. 

The set {0, 1}" corresponds to the vertex set of the n-dimensional cube. The edges 
of the n-dimensional cube correspond to pairs of binary vectors of length n which 
differ in exactly one entry. For the graph of the n-dimensional cube there exist perfect 
matchings, i.e., collections of disjoint edges which cover all vertices. Therefore we 
have the following: 

Corollary 2. Any distribution on {0, 1}" can be approximated arbitrarily well by an 
RBM with ^ — 1 hidden units. 

The proo f of Theorem [H given belo w is very much in the spirit of the proof of 



Theorem 2 in lLe Roux & Bengio I (120081) . The idea there consists on showing that given 
an RBM with some marginal visible distribution, the inclusion of an additional hidden 
unit allows to increment the probability mass of one visible state vector, while uniformly 
reducing the probability mass of all other visible vectors. 

We show that the inclusion of an additional hidden unit in fact allows to increase 
the probabiliy mass of a pair of visible vectors, in independent ratio, given that this 
pair differs in one entry. At the same time, the probability of all other visible states is 
reduced uniformly. We also use the offset weights in the visible units to further improve 
the result. 



( 2008 ) 



Proof of TheoremU} We stay close to the notation used in iLe Roux & Bengio 

1. Let p be the distribution on the states of visible and hidden units of an RBM. Its 

marginal probability distribution on v can be written as 

p(v) = ^^"^^'^^ . 

Denote by pm^c the distribution arising through the adding of a hidden unit to the RBM 

connected with weigths w = {wi, . . . , Wn) to the visible units, and with offset weight c. 

Its marginal distribution can be written as 

^ (1 + exp(w -v + c)) ^^ z{v, h) 
E.o,,o(l + exp{w ■ v^ + cMv^, h^) ■ 

2. Given any vector v E {0, 1}" we write f - for the vector defined through {vj)i = 
t^i! Vi 7^ j, and {vj)j = 0. We also write 1 := (1, . . . , 1), and Cj := 1 — 1-.. 



3. For any j E {1, . . . ,n} let vhe an arbitrary vector with Vj = 1, and s := \{i ^ 
j : Vi = 1}\. Define 



^ := aid. --!■), 

w := a{v. - -1-) + (A2 - \i)ej, 
c := —w ■ V + \i = ~w ■ Vj + Ai. 



For the weights w and c we have: 



w-v = -a{s - \{i : {v.)i ^ {v.),}\) + (A2 - \i)vj, 

1 
c = — as + Ai, 



and in the limit a — )• 00 we get: 



hm 1 + exp(w ■ V + c) = 1, Vf 7^ f , f--, 



Ai 



lim 1 + exp{w ■ v-- + c) = 1 + e \ 



J 



Um 1 + exp(w ■ t> + c) = 1 + e 



A2 



Just as in the Proof of Theorem 2 in 



Le Roux & Bengio I (120081) this yields for the 



marginal distribution on the visible states of the enlarged RBM the following: 

limp^,g(w) = X. t~ \ , Ao t~\ ' yv=^v,v-, 

a^oo 1 + e^^p{Vj) + e^2p(f ) ■' 



lim Pu,,c{'"i) 
lim PtD,c(t') 



(1 + e^')p{d.) 



^ 1 + e^^p{vj) + e^2p(f ) ' 



1 + e^^p{Vj) + e^^p{v) ' 
This means that the probability of v and of v-- can be increased independently by a 
multiplicative factor, while all other probabilities are reduced uniformly. 

4. Now we explain how to start an induction from which the claim follows. Con- 
sider an RBM with no hidden units, RBM°. Through a choice of the offset weigths 
in every visible unit, RBM° produces as visible distribution any arbitrary factorizable 
distribution p°(t>) oc exp(B ■ v) oc exp(B ■ v + K), where B is the vector of off- 
set weights and K is a constant that we introduce for illustrative reasons, and is not 
a parameter of the RBM° since it cancels out with the normalization of p°. In partic- 
ular, RBM° can approximate arbitrarily well any distribution with support given by a 



pair of vectors that differ in only one entry. To see this consider any pair of vectors v 
and Vj that differ in the entry j. Then, the choice B = a{Vj — |lj) + (A2 — Ai)ej 
and K = —a(v-- — ^lj)v + Ai yields in the limit linia^oo (similarly to the equa- 
tions in item 3. above) that lima^ooP°(^) = whenever v ^ v and v 7^ v-,, while 
liuia^ooP^{v)/p^{Vj) = exp(A2 — Ai) can be chosen arbitrarily by modifying Ai and 
A2. Hence, p° can be made arbitrarily similar to any distribution with support {v, v--}. 
Notice that p^ remains positive for all v and a < 00. 

By the arguments described above, every additional hidden unit allows to increase 
the probability of any pair of vectors which differ in one entry. Obviously, it is possible 
to do the same for a single vector instead of a pair. Thence, with every additional hidden 
unit the support set of the probabilities which can be approximated arbitrarily well is 
enlarged by an arbitrary pair of vectors which differ in one entry. This is, RBM^'^^^ 
is an approximator of distributions with support contained in any union of i pairs of 
vectors which differ in exactly one entry. D 

We close this passage with some remarks: 

The possiblity of independent change of the probability mass of two visible vectors 
is due to the usability of the following two parameters: a) The offset input weigth in the 
added hidden unit, and b) the weight of the connection between the added hidden unit 
and the visible unit where the pair of visible vectors differ. See item 3. in the Proof. 

The attempt to use a similar idea to increment the probability mass of three differ- 
ent vectors in independent ratios inducts a coupled change in the probability of a fourth 
vector. Three vectors differ in at least 2 entries, as do four vectors. Since only 3 param- 
eters are available (the offset of the new hidden unit and two connection weigths), the 
dependency arises. 

It is wort h noting, that using exclusiv ely a similar idea will not allow an exension of 



Theorem 2 in 



Le Roux & Bengio I (|2010|) to permit the flip of a certain bit with a certain 



probability (only) given one of three input vectors. 



2.2 Deep Belief Networks 



In this section we impl ement our Theorem [T] to rn odify the construction given in the 



proof of Theorem 4 in iLe Roux & Bengio I (|2010l) and prove our main result, Theo 



rem 



m 



^ + b 



Theorem 3 (Reduced DBN's which are universal approximators). Let n 

6 G N, 6 > 1. A DBN containing 2(n~h) hidden layers of width n is a universal 

approximator of distributions on {0, 1}"^. 

Before proving Theor em [3] we first develop some components of the proof. 



An important idea of 



Sutskever & Hinton 



(|2008h is that of sharing, by means of 



which in a part of a DBN the probability of a vector is increased while the probability 
of another vector is decreased and the proba bility of all other vectors r emains nearly 



constant. This idea is refined in Theorem 2 of iLe Roux & Bengio 



mm : 



Theorem 2 in 



Le Roux & Beng io I n2010i) (slightly different formulation). Consider 



two layers of units indexed by i & {1, . . . ,n} and k G {1, . . . , n}, and denote by v and 
h state vectors in each layer Denote by {wik}i,k=i,...,n the connection weights and by 
{ck}k=i,...,n the offset weights in the second layer Given any I and j, I ^ j, let a be an 
arbitrary vector m {0, 1}" and b another vector with hi = ai\li ^ j, and aj ^ bj. Then, 
it is possible to choose weights Wk,i, k G {1, . . . , n}, and ci such that the following 
equations are satisfied with arbitrary accuracy: P{vi = hi\h) = l\/h ^ {a, &}, while 
P{vi = l\h = a) = Pa cmd P{vi = l\h = b) = ph with arbitrary p a, Pb- 

By this Theorem, a sharing step can be accomplished in only one layer, whereas 
probability mass is transferred from a chosen vector to another vector differing in one 
entry. Futhermore, it demands adaptation only of the connection weights and offset 
weight of one single unit. Thereby, the overlay of a number of sharing steps in each 
layer is possible. 



The main idea in 



Le Roux & Bengio I (|2010|) was to exploit these circumstances 



using a clever sequence of transactions of probabil ities. The requirements for the re 



alizability of sharing sequences using Theorem 2 in lLe Roux & Bengio 



(l20inh can be 



summarized i n properties of seqtiences of vectors. These properties are described in 



Theorem 3 of 



Le Roux & Bengio I (|2010|) . or in the items 2-3 of our appropriately mod- 



ified version of that Theor em, Lemma |4l below . 



How the Theorem 2 in 



Le Roux & Bengio 



(120 101) and Lemma|4]brace the construc- 



tion of a universal DBN approximator will become clearer in the afterwards following 
Lemma [5l 



8 



Lemma 4. Let n = y + 6, 6 G N, 6 > 1. There exist a := 2^ = 2{n — h) sequences of 
binary vectors Si, Q < i < a — 1 composed of vectors Si^k, 1 < ^ < ^ satisfying the 
following: 

1. {5'o, . . . , Sa-i} is a partition o/{0, l}'^. 

2. Vi G {0, . . . , a — 1}, V/c G {1, . . . , — — 1} we have H{Si^k, Si^k+i) = 1. where 
H{-,-) denotes the Hamming distance. 

3. Vi, j G {0, . . . , a — 1} such that i ^ j and VA; G {1, . . . , ^ 1} the bit switched 

between Si^k cmd Sj^k+i (^nd the bit switched between Sj^k (^nd Sj^k+i are different, 
unless H{Si^k, Sj^k) = 1- 

Proof of Lemma^ Let G'°_^ be any Gray code for (n — b) bits. Such a Gray code 
is a matrix of size 2"^^ x [n — h), where every two consecutive rows have Hamming 
distance one to each other, and the collection of all rows is {0, 1}"^^. Obviously any 
permutation of columns of this Gray code has the same properties. Let G'^_^ be the 
cyclic permutation of columns i positions to the left. 



'^bin(i) ^ 



Now define S", 



Gimod(n— 6) 
n-h 



i.e. the first h bits of the vector S, 



i,k 



\hin{i) j 

contain the &-bit binary representation of i. The rest of the bits contain the k-ih. row in 

the Gray code G^^^ for arrays of length n — h cyclically shifted i positions to the left. 

The cyclic permutation makes that every two sequences of vectors Si and Sj, i ^ j 

change the same bit in the same row (in this case they also do in every row) only if the 

value of the first part bin(2) and bin(j) of the two sequences differs in only one entry (in 

the first entry). D 

Every two consecutive vectors in a sequence given in Lemma |4] differ in only one 
entry and this entry can be located in almost any position |1, . . . , n}. In contrast, for the 



sequences given in Theorem 3 of iLe Roux & Bengio I (|2010|) that entry can be located 



only in a subset of {1, . . . , n} of cardinality n/2. 

In the Lemma above, for any row, every one of n — b entries is flipped by exactly two 
sequences. Regard that the attempt to produce 2n instead of 2 (n — b) sequences with the 
properties 1-2 of the Lemma (and flips in all entries) would correspond to the following: 



Set 



^s^ 



\S2ni 



Gn, i.e., the sequences to be overlayed are portions of the same Gray 



code. In this case it is difficuk to achive that condition 3. is satistfied, i.e., that if Si and 
Sj flip the same bit in the same row, then HjSi^k, Sj^k) = 1- The con dition 3. however 



is essential for the use of Theorem 2 of 



Le Roux & Bengio 



( 2010 ). Most common 



Gray codes flip some entries more often than other entries and can be discarded. Oher 
sequences referred to as totally balanced Gray codes flip all entries equally often and 
exist whenever n is a power of 2, but still a strong cyclicity condition would be required. 
On account o f this we say that the sequences given in Lemma |4] allow optimal use of 
Theorem 2 of Le Roux & Bengio ; (20101). 



The following Lemma[5]is a transcription of Lemma 1 in lLe Roux & Bengio 



(2010) 



with replacements of indices according to our construction. The proof is an obvious 
transcription which we omit here. Denote by k'- a state vector of the units in the hidden 
layer i, and denote by /i° a visible state. 

Lemma 5. Let p* be an arbitrary distribution on {0, 1}". Consider a DBN with ^ + 1 
layers and the following properties: 

1. Vi G {0, . . . , a — 1} the top RBM between h~ and h~ ^ assigns probability 

2. VzG{0,...,a-l},VA;G{l,...,f -1} 



5. 



i,k) 



s, 



i,k 



2" 

P*{S^,k) 

2" ) 



3. \fke{l,...,^-l}the DBN provides 

Such a DBN has p* as its marginal visible distribution. 

We conclude this section with the proof of Theorem [3] and some remarks: 
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Proof of Theorem \3\ The proof is analogous to the Proof of Theorem 4 in lLe Roux & Bengio 
(|2010h . We just need to show the existence of a DBN with the properties of the DBN 
described in Lemma [51 In view of Theorem [T] it is possible to achive that the top RBM 
assigns arbitrary probability to the collection of vectors S'j^i, i G {0, . . . , a — 1}, when- 
ever it can be arranged in pairs of neighbouring vectors (or from Corollary |2l if all vec- 
tors are equal in a set of entries). This requirement is met for Sj i, i G {0, . . . , a — 1} of 
Lemma m (e.g. choosing a Gray code whose first element is (0, . . . , 0) or (1 1)) 



The su bsequent layers are just like in the Proof of Theorem 4 in iLe Roux & Bengio 



in 



(l2010r). They are pos s ible i n consideration of the mantained validity of Theorem 2 



Le Roux & Bengio I (|2010|) using the sequences provided in Lemma |4] of the present 



paper. The only difference is that by our definition of Si, i E {0, . . . , a — 1}, at each 



layer n — 6 bit flips (with correct probabilities) occur, instead of |. 



D 



In the paper iLe Roux & Bengio I (|2010h the authors overlay ed n sequences of shar- 



ing steps (Theorem 3 in that paper) for the construction of a universal DBN approx- 
imator. In principle an overlay of more such sequences is possible. This is what 
we exploit in our proof, (the sequences given in Lemma S]). Apparently, the over- 
lay of more sequences was not realized in that paper because for the initialization of 
th ese sequences, (prop e rty 1. in Lemma 1 in that paper), the authors use Theorem 2 



of 



Le Roux & Bengio I (|2008h . which only allows to assign arbitrary probability to n 



vectors. Our result Theorem [H overcomes this difficulty and allows to initialize up to 
2{n + 1) sequences, which we use to obtain property 1. in Lemma[5l 

3 Conclusion 



We have shown that a Deep Belief Network (DBN) with 



2(n-b) 



b ~ log n, hidden layers 



of size n is capable of approximating any distribution on {0, 1}" ar bitrarily well as its 
margin al visible distribution. (This confirms a conjecture presented in lLe Roux & Bengio 



(|2010|) ). The number of layers 



2{n-b) 

parameters, which is of order ^. 



is of order ^. This DBN has 



2n 



2{n-b) 



n^+ 



2{n-b) 



n+n 



2" 



Furthermore, we have shown that a Restricted Boltzmann Machine (RBM) with 
— 1 hidden units is capable of approximating any distribution on {0, 1}" arbitrarily 



11 



well as its marginal visible distribution. This RBM has ^n + ^ parameters, which is 



of order ^ 



Our results improve all to date known bounds on the minimal size of universal DBN 
and RBN approximators. We still do not know if our results represent the minimal 
sufficient size for univ ersal DBN and RBN appro ximators. Our construction already 



exploits Theorem 2 in iLe Roux & Bengio I (120101) exhaustively, and therefore a con 



struction using only similar ideas will not allow improvements. However, we have per- 
formed numerical computations (we do not include details here) showing the existence 
of RBM's containing less than ^ — 1 hidden units and which can approximate complex 
classes of distributions on {0, 1}" arbitrarily well. This suggests that in the present con- 
struction the representational power of RBM's is not fully exploited. Whether further 
reductions of the size of a universa l DBN approximator are possible is subject of our 



ongoing research. 



Montufar 



torn . 
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